short description of the study and of the specific data and replication analyses you will be performing, to orient your reader. Outline (briefly) the goal of the original paper, the data set used, and the analyses conducted, then describe which you will replicate. You should also demonstrate how you read your datafile into R, and show a few lines of raw data in your output (e.g., using head()).
Install {moments}. Load the following packages:
> library(curl)
## Using libcurl 7.64.1 with LibreSSL/2.8.3
> library(ggplot2)
> library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x readr::parse_date() masks curl::parse_date()
> library(dplyr)
> library(moments)
Figure 1: Density plots of Tb of 10 Kalahari lizard species in summer: distributions are centred on modal temperature of each species. Left skewness is visually evident for many species and significant in seven (Table S1)
“Because these analyses evaluate an explicit (a priori) hypothesis that Tb distributions are left-skewed, we used one-tailed tests.”
> f <- curl("https://raw.githubusercontent.com/maekell98/nkelley-data-replication-assignment/main/doi_10-5/KalahariTbData.csv")
> kl <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
> head(kl)
## area number latitude species Tb time date family
## 1 L 15065 -26.38 Agama aculeata 35.2 12:25 11/28/69 Agamdidae
## 2 L 15087 -26.38 Agama aculeata 38.0 16:28 11/28/69 Agamdidae
## 3 K 15092 -25.75 Agama aculeata 37.2 10:57 11/29/69 Agamdidae
## 4 L 15123 -26.38 Agama aculeata 35.6 17:40 11/28/69 Agamdidae
## 5 K 15141 -25.75 Agama aculeata 37.0 12:50 11/30/69 Agamdidae
## 6 K 15144 -25.75 Agama aculeata 38.0 15:30 11/30/69 Agamdidae
Note: Southern latitude summer = December through February. Therefore, we must subset the data by date to include only data collected in the summer months.
“We arbitrarily set a minimum sample size of N = 23 Tb for inclusion.” Since Trachylepis variegata does not have this many samples, it will be excluded from further analysis at the end of the the following chunk.
> #in order to subset by date, dates must be coerced into a format recognized by r:
> #date formatting: https://stackoverflow.com/questions/38146160/how-to-convert-dd-mm-yy-to-yyyy-mm-dd-in-r
> kl$date <- as.Date(kl$date, "%m/%d/%Y")
>
> #kls = kalahari lizards summer subset
> #date subsetting: https://www.statology.org/subset-by-date-range-in-r/
> kld <- kl[kl$date >= "0069-12-02" & kl$date <= "0070-02-27", ]
>
> #not enough data for species Trachylepis spilogaster, deleting for further analyis
> kls <- subset(kld, species!="Trachylepis variegata")
> head(kls)
## area number latitude species Tb time date family
## 7 <NA> 15166 -27.07 Agama aculeata 38.4 14:18 0069-12-02 Agamdidae
## 9 <NA> 15192 -27.07 Agama aculeata 40.4 14:33 0069-12-02 Agamdidae
## 10 <NA> 15194 -27.07 Agama aculeata 36.2 14:54 0069-12-02 Agamdidae
## 11 <NA> 15233 -26.85 Agama aculeata 37.6 11:00 0069-12-04 Agamdidae
## 12 <NA> 15241 -26.80 Agama aculeata 37.8 8:50 0069-12-05 Agamdidae
## 13 <NA> 15242 -26.80 Agama aculeata 34.8 8:55 0069-12-05 Agamdidae
In order to obtain density plots by species, we first subset by species:
> #used unique to quickly get a list of species in the kl data set
> unique(kld$species)
## [1] "Agama aculeata" "Heliobolus lugubris"
## [3] "Meroles squamulosus" "Meroles suborbitalis"
## [5] "Nucras tessellata" "Pedioplanis namaquensis"
## [7] "Pedioplanis_lineoocellata" "Trachylepis occidentalis"
## [9] "Trachylepis sparsa" "Trachylepis spilogaster"
## [11] "Trachylepis variegata"
> #define function for calculating mode: https://r-lang.com/mode-in-r/
> getmode <- function(v) {
+ uniqv <- unique(v)
+ uniqv[which.max(tabulate(match(v, uniqv)))]
+ }
>
> #now we create a subset for each species in which the mode of each Tb is adjusted to 0 and represented by column Tbm
> ss1 <- subset(kls, species == "Agama aculeata") %>%
+ mutate(Tbm = Tb - getmode(Tb))
>
> ss2 <- subset(kls, species == "Heliobolus lugubris") %>%
+ mutate(Tbm = Tb - getmode(Tb))
>
> ss3 <- subset(kls, species == "Meroles squamulosus") %>%
+ mutate(Tbm = Tb - getmode(Tb))
>
> ss4 <- subset(kls, species == "Meroles suborbitalis") %>%
+ mutate(Tbm = Tb - getmode(Tb))
>
> ss5 <- subset(kls, species == "Nucras tessellata") %>%
+ mutate(Tbm = Tb - getmode(Tb))
>
> ss6 <- subset(kls, species == "Pedioplanis lineoocellata") %>%
+ mutate(Tbm = Tb - getmode(Tb))
>
> ss7 <- subset(kls, species == "Pedioplanis namaquensis")%>% mutate(Tbm = Tb - getmode(Tb))
>
> ss8 <- subset(kls, species == "Trachylepis occidentalis") %>% mutate(Tbm = Tb - getmode(Tb))
>
> ss9 <- subset(kls, species == "Trachylepis sparsa") %>%
+ mutate(Tbm = Tb - getmode(Tb))
>
> ss10 <- subset(kls, species == "Trachylepis spilogaster") %>% mutate(Tbm = Tb - getmode(Tb))
Now we can run the ggplot. In order to show all the species subsets on the same plot, we simply add geom_density() with each of the different subsets.
> #constructing density plots: http://www.sthda.com/english/wiki/ggplot2-density-plot-quick-start-guide-r-software-and-data-visualization
> s <- ggplot(ss1, aes(x=Tbm)) +
+ geom_density() +
+ geom_density(data = ss2) +
+ geom_density(data = ss3) +
+ geom_density(data = ss4) +
+ geom_density(data = ss5) +
+ geom_density(data = ss6) +
+ geom_density(data = ss7) +
+ geom_density(data = ss8) +
+ geom_density(data = ss9) +
+ geom_density(data = ss10) +
+ #add mean line:
+ geom_vline(aes(xintercept=0),
+ color="black", linetype="dashed") +
+ #classic theme seems most appropriate (removes grid background, adds tick marks, black and white)
+ theme_classic() +
+ #set axes:
+ xlim(-15, 5) + ylim(0, 0.5)+
+ #set tick marks:
+ scale_y_continuous(breaks = c(0.0, 0.1, 0.2))+
+ scale_x_continuous(breaks = c(-15, -10, -5, 0, 5))+
+ #add labels,
+ #code for degrees celsius symbol: https://stackoverflow.com/questions/51799118/writing-the-symbol-degrees-celsius-in-axis-titles-with-r-plotly/51799161
+ labs(title="Density plots Kalahari lizards, summer",x="Body temperature (\u00B0C), centered on modal temperature", y = "Density")
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
> s
We can see from the figure that most Kalahari lizard species blood temperature is exhibits left skewness.
For comparison, we have the origial plot:
Figure 3: Body temperature skewness (jittered) for desert lizards, with values by family (or by subfamily for Agamidae) and by continent. Three families are represented by one species. No pattern (desert, taxon) is evident.
> f <- curl("https://raw.githubusercontent.com/maekell98/nkelley-data-replication-assignment/main/doi_10-5/KalahariTbData.csv")
> kl <- read.csv(f, header = TRUE, sep = ",", stringsAsFactors = FALSE)
> kl['desert']='Africa'
> head(kl)
## area number latitude species Tb time date family desert
## 1 L 15065 -26.38 Agama aculeata 35.2 12:25 11/28/69 Agamdidae Africa
## 2 L 15087 -26.38 Agama aculeata 38.0 16:28 11/28/69 Agamdidae Africa
## 3 K 15092 -25.75 Agama aculeata 37.2 10:57 11/29/69 Agamdidae Africa
## 4 L 15123 -26.38 Agama aculeata 35.6 17:40 11/28/69 Agamdidae Africa
## 5 K 15141 -25.75 Agama aculeata 37.0 12:50 11/30/69 Agamdidae Africa
## 6 K 15144 -25.75 Agama aculeata 38.0 15:30 11/30/69 Agamdidae Africa
> #Basic R script for calculating stats describing Tb distributions:
> klskew <- kl %>%
+ group_by(family) %>%
+ summarise(
+ meanTb = round(mean(Tb, na.rm = TRUE), digits = 2),
+ medTb = median(Tb,na.rm = TRUE),
+ MAD = round(mad(Tb, na.rm = TRUE), digits = 2),
+ minTb = min(Tb, na.rm = TRUE),
+ maxBT = max(Tb, na.rm = TRUE),
+ nBT = n(),
+ SkewCoef = agostino.test(Tb, alternative = "greater")$statistic[1],
+ SkewP = agostino.test(Tb, alternative = "greater")$p.value,
+ desert = unique(desert))
> g <- curl("https://raw.githubusercontent.com/maekell98/nkelley-data-replication-assignment/main/doi_10-5/AustraliaTbData.csv")
> al<- read.csv(g, header = TRUE, sep = ",", stringsAsFactors = FALSE)
> al['desert']='Australia'
> names(al)[4]<-paste("Tb")
> head(al)
## area number species Tb time date microhabitat
## 1 E 10448 Varanus eremius 41.0 14:35 01/20/66 terrestrial
## 2 <NA> 9828 Ctenophorus reticulatus 31.8 15:15 10/03/66 semi-arboreal
## 3 <NA> 9830 Ctenophorus reticulatus 31.6 15:24 10/03/66 semi-arboreal
## 4 <NA> 9832 Lophognathus longirostris 37.6 11:15 10/04/66 arboreal
## 5 <NA> 9834 Ctenophorus caudicinctus 38.0 14:00 10/05/66 saxicolous
## 6 <NA> 9835 Ctenophorus isolepis 39.2 16:00 10/05/66 terrestrial
## family desert
## 1 Varanidae Australia
## 2 Agamidae Australia
## 3 Agamidae Australia
## 4 Agamidae Australia
## 5 Agamidae Australia
## 6 Agamidae Australia
> alskew <- al %>%
+ group_by(family) %>%
+ summarise(
+ meanTb = round(mean(Tb, na.rm = TRUE), digits = 2),
+ medTb = median(Tb,na.rm = TRUE),
+ MAD = round(mad(Tb, na.rm = TRUE), digits = 2),
+ minTb = min(Tb, na.rm = TRUE),
+ maxBT = max(Tb, na.rm = TRUE),
+ nBT = n(),
+ SkewCoef = agostino.test(Tb, alternative = "greater")$statistic[1],
+ SkewP = agostino.test(Tb, alternative = "greater")$p.value,
+ desert = unique(desert))
> h <- curl("https://raw.githubusercontent.com/maekell98/nkelley-data-replication-assignment/main/doi_10-5/NorthAmericanTbData.csv")
> nl <- read.csv(h, header = TRUE, sep = ",", stringsAsFactors = FALSE)
> nl['desert']='North Amer.'
> head(nl)
## site number latitude NS species Tb time date
## 1 NAN 5169 42.2 north Aspidoscelis tigris 38.2 7:17 07/01/62
## 2 NAN 5170 42.2 north Phrynosoma platyrhinos 32.0 7:20 07/01/62
## 3 NAN 5171 42.2 north Aspidoscelis tigris 38.0 7:29 07/01/62
## 4 NAN 5172 42.2 north Uta stansburiana 36.0 7:34 07/01/62
## 5 NAN 5173 42.2 north Uta stansburiana 33.2 7:43 07/01/62
## 6 NAN 5174 42.2 north Uta stansburiana 35.4 7:44 07/01/62
## family desert
## 1 Teiidae North Amer.
## 2 Phrynosomatidae North Amer.
## 3 Teiidae North Amer.
## 4 Phrynosomatidae North Amer.
## 5 Phrynosomatidae North Amer.
## 6 Phrynosomatidae North Amer.
> nlskew <- nl %>%
+ group_by(family) %>%
+ summarise(
+ meanTb = round(mean(Tb, na.rm = TRUE), digits = 2),
+ medTb = median(Tb,na.rm = TRUE),
+ MAD = round(mad(Tb, na.rm = TRUE), digits = 2),
+ minTb = min(Tb, na.rm = TRUE),
+ maxBT = max(Tb, na.rm = TRUE),
+ nBT = n(),
+ SkewCoef = agostino.test(Tb, alternative = "greater")$statistic[1],
+ SkewP = agostino.test(Tb, alternative = "greater")$p.value,
+ desert = unique(desert))
> #bind data -- retain all columns but combine columns in common; NA for columns not in common
> #https://www.geeksforgeeks.org/combine-two-dataframes-in-r-with-different-columns/
> ld <- bind_rows(kl, nl, al)
> #oop. Maybe I didn't need this.
>
> #combine skew summaries:
> lskew <- bind_rows(nlskew, alskew, klskew)
> #make a dot plot: http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
> lplot <-
+ ggplot(lskew, aes(x=SkewCoef, y=family, group=desert)) +
+ #specify shape/color of points: http://www.sthda.com/english/wiki/ggplot2-point-shapes
+ geom_point(aes(shape=desert, color=desert), size = 3) +
+ scale_shape_manual(values=c(25, 8, 17))+
+ scale_color_manual(values=c('red', 'blue', 'grey48'))+
+ scale_fill_manual(values='red')+
+ labs(x = "Skewness", y = "Families or subfamilies") +
+ theme_classic() +
+ xlim(-2.0, 0.5) +
+ scale_x_continuous(breaks = c(-2.0, -1.5, -1.0, -0.5, 0.0, 0.5)) +
+ geom_vline(aes(xintercept=0),
+ color="black", linetype="dashed") +
+ xlim(-2.0, 0.5)
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
> lplot
For comparison, we have the original plot:
You can see my plot did not include the phylogenetic comparison. It also contains different families compared to the original plot. I was unable to find the families used in this plot when I was going through the data. I am not sure if they used other outside information to further classify families/subfamilies, but the lack of clarification made it impossible for me to truly replicate their figure.